This notebook is one in a series of many, where we explore how different data analysis strategies affect the outcome of a proteomics experiment based on isobaric labeling and mass spectrometry. Each analysis strategy or ‘workflow’ can be divided up into different components; it is recommend you read more about that in the introduction notebook.
In this notebook specifically, we investigate the effect of varying the Summarization component on the outcome of the differential expression results. The three component variants are: Median summarization, iPQF, Sum summarization.
The R packages and helper scripts necessary to run this notebook are listed in the next code chunk: click the ‘Code’ button. Each code section can be expanded in a similar fashion. You can also download the entire notebook source code.
library(stringi)
library(gridExtra)
library(dendextend)
library(kableExtra)
library(limma)
library(psych)
library(MSnbase) # CAREFUL! load this BEFORE tidyverse, or you will screw up the rename function.
library(tidyverse)
source('other_functions.R')
source('plotting_functions.R')
Let’s load our PSM-level data set:
data.list <- readRDS(params$input_data_p)
dat.l <- data.list$dat.l # data in long format
display_dataframe_head(dat.l)
| Mixture | TechRepMixture | Condition | BioReplicate | Run | Channel | Protein | Peptide | RT | Charge | PTM | intensity | TotalIntensity | Ions.Score | DeltaMZ | isoInterOk | noNAs | onehit.protein | shared.peptide |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Mixture2 | 1 | 1 | Mixture2_1 | Mixture2_1 | 127C | P21291 | [K].gFGFGQGAGALVHSE.[-] | 144.9444 | 2 | N-Term(TMT6plex) | 44048.41 | 394657.6 | 65 | 0.00052 | TRUE | TRUE | FALSE | FALSE |
| Mixture2 | 1 | 0.5 | Mixture2_0.5 | Mixture2_1 | 127N | P21291 | [K].gFGFGQGAGALVHSE.[-] | 144.9444 | 2 | N-Term(TMT6plex) | 38298.89 | 394657.6 | 65 | 0.00052 | TRUE | TRUE | FALSE | FALSE |
| Mixture2 | 1 | 0.125 | Mixture2_0.125 | Mixture2_1 | 128C | P21291 | [K].gFGFGQGAGALVHSE.[-] | 144.9444 | 2 | N-Term(TMT6plex) | 39552.81 | 394657.6 | 65 | 0.00052 | TRUE | TRUE | FALSE | FALSE |
| Mixture2 | 1 | 0.667 | Mixture2_0.667 | Mixture2_1 | 128N | P21291 | [K].gFGFGQGAGALVHSE.[-] | 144.9444 | 2 | N-Term(TMT6plex) | 56730.70 | 394657.6 | 65 | 0.00052 | TRUE | TRUE | FALSE | FALSE |
| Mixture2 | 1 | 0.667 | Mixture2_0.667 | Mixture2_1 | 129C | P21291 | [K].gFGFGQGAGALVHSE.[-] | 144.9444 | 2 | N-Term(TMT6plex) | 48744.49 | 394657.6 | 65 | 0.00052 | TRUE | TRUE | FALSE | FALSE |
| Mixture2 | 1 | 1 | Mixture2_1 | Mixture2_1 | 129N | P21291 | [K].gFGFGQGAGALVHSE.[-] | 144.9444 | 2 | N-Term(TMT6plex) | 60179.96 | 394657.6 | 65 | 0.00052 | TRUE | TRUE | FALSE | FALSE |
After the filtering done in data_prep.R, there are 19 UPS1 proteins remaining, even though 48 were originally spiked in.
# which proteins were spiked in?
spiked.proteins <- dat.l %>% distinct(Protein) %>% filter(stri_detect(Protein, fixed='ups')) %>% pull %>% as.character
remove_factors(spiked.proteins)
## [1] "P02787ups" "P02788ups" "P01133ups" "P69905ups" "P01008ups"
## [6] "P02741ups" "P01375ups" "P10636-8ups" "O00762ups" "P62988ups"
## [11] "P10145ups" "P05413ups" "P00915ups" "P02753ups" "P02144ups"
## [16] "P01344ups" "P01579ups" "P00709ups" "P01127ups"
tmp=dat.l %>% distinct(Protein) %>% pull %>% as.character
# protein subsampling
if (params$subsample_p>0 & params$subsample_p==floor(params$subsample_p) & params$subsample_p<=length(tmp)){
sub.prot <- tmp[sample(1:length(tmp), size=params$subsample_p)]
if (length(spiked.proteins)>0) sub.prot <- c(sub.prot,spiked.proteins)
dat.l <- dat.l %>% filter(Protein %in% sub.prot)
}
We store the metadata in sample.info and show some entries below. We also pick technical replicates with a dilution factor of 0.5 as the reference condition of interest. Each condition is represented by two of eight reporter Channels in each Run.
display_dataframe_head(sample.info)
| Run | Run.short | Channel | Sample | Sample.short | Condition | Color |
|---|---|---|---|---|---|---|
| Mixture1_1 | Mix1_1 | 127C | Mixture1_1:127C | Mix1_1:127C | 0.125 | black |
| Mixture1_1 | Mix1_1 | 127N | Mixture1_1:127N | Mix1_1:127N | 0.667 | green |
| Mixture1_1 | Mix1_1 | 128C | Mixture1_1:128C | Mix1_1:128C | 1 | red |
| Mixture1_1 | Mix1_1 | 128N | Mixture1_1:128N | Mix1_1:128N | 0.5 | blue |
| Mixture1_1 | Mix1_1 | 129C | Mixture1_1:129C | Mix1_1:129C | 0.5 | blue |
| Mixture1_1 | Mix1_1 | 129N | Mixture1_1:129N | Mix1_1:129N | 0.125 | black |
referenceCondition
## [1] "0.5"
channelNames
## [1] "127C" "127N" "128C" "128N" "129C" "129N" "130C" "130N"
We use the default unit scale: the log2-transformed reportion ion intensities.
dat.unit.l <- dat.l %>% mutate(response=log2(intensity)) %>% select(-intensity)
display_dataframe_head(dat.unit.l)
| Mixture | TechRepMixture | Condition | BioReplicate | Run | Channel | Protein | Peptide | RT | Charge | PTM | TotalIntensity | Ions.Score | DeltaMZ | isoInterOk | noNAs | onehit.protein | shared.peptide | response |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Mixture2 | 1 | 1 | Mixture2_1 | Mixture2_1 | 127C | P21291 | [K].gFGFGQGAGALVHSE.[-] | 144.9444 | 2 | N-Term(TMT6plex) | 394657.6 | 65 | 0.00052 | TRUE | TRUE | FALSE | FALSE | 15.42680 |
| Mixture2 | 1 | 0.5 | Mixture2_0.5 | Mixture2_1 | 127N | P21291 | [K].gFGFGQGAGALVHSE.[-] | 144.9444 | 2 | N-Term(TMT6plex) | 394657.6 | 65 | 0.00052 | TRUE | TRUE | FALSE | FALSE | 15.22502 |
| Mixture2 | 1 | 0.125 | Mixture2_0.125 | Mixture2_1 | 128C | P21291 | [K].gFGFGQGAGALVHSE.[-] | 144.9444 | 2 | N-Term(TMT6plex) | 394657.6 | 65 | 0.00052 | TRUE | TRUE | FALSE | FALSE | 15.27149 |
| Mixture2 | 1 | 0.667 | Mixture2_0.667 | Mixture2_1 | 128N | P21291 | [K].gFGFGQGAGALVHSE.[-] | 144.9444 | 2 | N-Term(TMT6plex) | 394657.6 | 65 | 0.00052 | TRUE | TRUE | FALSE | FALSE | 15.79184 |
| Mixture2 | 1 | 0.667 | Mixture2_0.667 | Mixture2_1 | 129C | P21291 | [K].gFGFGQGAGALVHSE.[-] | 144.9444 | 2 | N-Term(TMT6plex) | 394657.6 | 65 | 0.00052 | TRUE | TRUE | FALSE | FALSE | 15.57295 |
| Mixture2 | 1 | 1 | Mixture2_1 | Mixture2_1 | 129N | P21291 | [K].gFGFGQGAGALVHSE.[-] | 144.9444 | 2 | N-Term(TMT6plex) | 394657.6 | 65 | 0.00052 | TRUE | TRUE | FALSE | FALSE | 15.87700 |
Median sweeping means subtracting from each PSM quantification value the spectrum median (i.e., the row median computed across samples/channels). If the unit scale is set to intensities or ratios, the multiplicative variant of this procedure is applied: subtraction is replaced by division. Since median sweeping needs to be applied on matrix-like data, let’s switch to wide format. (Actually, this is semi-wide, since the Channel columns still have contributions form all Runs, but that’s OK because in the next step we split by Run.)
# switch to wide format
dat.unit.w <- pivot_wider(data = dat.unit.l, id_cols=-one_of(c('Condition', 'BioReplicate')), names_from=Channel, values_from=response)
display_dataframe_head(dat.unit.w)
| Mixture | TechRepMixture | Run | Protein | Peptide | RT | Charge | PTM | TotalIntensity | Ions.Score | DeltaMZ | isoInterOk | noNAs | onehit.protein | shared.peptide | 127C | 127N | 128C | 128N | 129C | 129N | 130C | 130N |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Mixture2 | 1 | Mixture2_1 | P21291 | [K].gFGFGQGAGALVHSE.[-] | 144.9444 | 2 | N-Term(TMT6plex) | 394657.56 | 65 | 0.00052 | TRUE | TRUE | FALSE | FALSE | 15.42680 | 15.22502 | 15.27149 | 15.79184 | 15.57295 | 15.87700 | 15.71883 | 15.69836 |
| Mixture2 | 1 | Mixture2_1 | P30511 | [K].wAAVVVPPGEEQR.[Y] | 137.1704 | 2 | N-Term(TMT6plex) | 180359.11 | 27 | -0.00653 | TRUE | TRUE | TRUE | FALSE | 14.36513 | 14.14964 | 14.27690 | 14.65920 | 14.44955 | 14.58652 | 14.54510 | 14.57900 |
| Mixture2 | 1 | Mixture2_1 | P02787ups | [K].sASDLTWDNLk.[G] | 148.8299 | 3 | N-Term(TMT6plex); K11(TMT6plex) | 507218.12 | 35 | -0.00005 | TRUE | TRUE | FALSE | FALSE | 16.31046 | 15.32833 | 15.11530 | 16.27180 | 15.92615 | 16.73355 | 15.74796 | 15.42268 |
| Mixture2 | 1 | Mixture2_1 | Q9UBQ5 | [K].iDFDSVSSIMASSQ.[-] | 201.8727 | 2 | N-Term(TMT6plex) | 120944.32 | 72 | 0.00119 | TRUE | TRUE | FALSE | FALSE | 13.51956 | 13.51995 | 13.66172 | 14.36018 | 13.87050 | 14.20738 | 13.86685 | 13.83673 |
| Mixture2 | 1 | Mixture2_1 | Q5H9R7 | [R].tGQPSAPGDTSVNGPV.[-] | 106.9002 | 2 | N-Term(TMT6plex) | 99098.46 | 23 | -0.00133 | TRUE | TRUE | FALSE | FALSE | 13.54513 | 13.17495 | 13.34133 | 13.81323 | 13.58510 | 14.02262 | 13.60511 | 13.51803 |
| Mixture2 | 1 | Mixture2_1 | P43490 | [K].nAQLNIELEAAHH.[-] | 99.8437 | 3 | N-Term(TMT6plex) | 497023.17 | 15 | 0.00061 | TRUE | TRUE | FALSE | FALSE | 15.58965 | 15.43193 | 15.86127 | 16.18651 | 15.86677 | 16.28384 | 15.93733 | 16.03418 |
Now let’s sweep the medians of all the rows. No need to split this per Run, because each row in this semi-wide format contains only values from one Run and each median calculation is independent of the other rows.
# subtract the spectrum median log2intensity from the observed log2intensities
dat.norm.w <- dat.unit.w
dat.norm.w[,channelNames] <- median_sweep(dat.norm.w[,channelNames], 1, '-')
display_dataframe_head(dat.norm.w)
| Mixture | TechRepMixture | Run | Protein | Peptide | RT | Charge | PTM | TotalIntensity | Ions.Score | DeltaMZ | isoInterOk | noNAs | onehit.protein | shared.peptide | 127C | 127N | 128C | 128N | 129C | 129N | 130C | 130N |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Mixture2 | 1 | Mixture2_1 | P21291 | [K].gFGFGQGAGALVHSE.[-] | 144.9444 | 2 | N-Term(TMT6plex) | 394657.56 | 65 | 0.00052 | TRUE | TRUE | FALSE | FALSE | -0.2088531 | -0.4106403 | -0.3641628 | 0.1561868 | -0.0627039 | 0.2413401 | 0.0831727 | 0.0627039 |
| Mixture2 | 1 | Mixture2_1 | P30511 | [K].wAAVVVPPGEEQR.[Y] | 137.1704 | 2 | N-Term(TMT6plex) | 180359.11 | 27 | -0.00653 | TRUE | TRUE | TRUE | FALSE | -0.1321964 | -0.3476832 | -0.2204273 | 0.1618711 | -0.0477788 | 0.0891915 | 0.0477788 | 0.0816737 |
| Mixture2 | 1 | Mixture2_1 | P02787ups | [K].sASDLTWDNLk.[G] | 148.8299 | 3 | N-Term(TMT6plex); K11(TMT6plex) | 507218.12 | 35 | -0.00005 | TRUE | TRUE | FALSE | FALSE | 0.4734048 | -0.5087304 | -0.7217574 | 0.4347442 | 0.0890943 | 0.8964941 | -0.0890943 | -0.4143805 |
| Mixture2 | 1 | Mixture2_1 | Q9UBQ5 | [K].iDFDSVSSIMASSQ.[-] | 201.8727 | 2 | N-Term(TMT6plex) | 120944.32 | 72 | 0.00119 | TRUE | TRUE | FALSE | FALSE | -0.3322250 | -0.3318356 | -0.1900625 | 0.5083959 | 0.0187098 | 0.3555902 | 0.0150596 | -0.0150596 |
| Mixture2 | 1 | Mixture2_1 | Q5H9R7 | [R].tGQPSAPGDTSVNGPV.[-] | 106.9002 | 2 | N-Term(TMT6plex) | 99098.46 | 23 | -0.00133 | TRUE | TRUE | FALSE | FALSE | -0.0199852 | -0.3901617 | -0.2237788 | 0.2481239 | 0.0199852 | 0.4575068 | 0.0399986 | -0.0470807 |
| Mixture2 | 1 | Mixture2_1 | P43490 | [K].nAQLNIELEAAHH.[-] | 99.8437 | 3 | N-Term(TMT6plex) | 497023.17 | 15 | 0.00061 | TRUE | TRUE | FALSE | FALSE | -0.3124028 | -0.4701170 | -0.0407846 | 0.2844554 | -0.0352769 | 0.3817914 | 0.0352769 | 0.1321293 |
These (partially) normalized quantification values are now already comparable, but after summarization we will also sweep the columns on the protein level, as suggested by Herbrich at al..
In the next three subsections, let’s look at our different ways to summarize quantification values from PSM to peptide (first step) to protein (second step) in each sample.
dat.norm.summ.w <- emptyList(variant.names)
dat.nonnorm.summ.w <- emptyList(variant.names)
Median summarization is simple: within each Run and within each Channel, we replace multiple related observations with their median. First, for each Peptide (median of the PSM values), then for each Protein (median of the peptide values).
median_summarization <- function(dat) {
# group by (run,)protein,peptide then summarize twice (once on each level)
# add select() statement because summarise_at is going bananas over character columns
return(dat %>% group_by(Run, Protein, Peptide) %>% select(Run, Protein, Peptide, channelNames) %>% summarise_at(.vars = channelNames, .funs = median) %>% select(Run, Protein, channelNames) %>% summarise_at(.vars = channelNames, .funs = median) %>% ungroup())
}
# normalized data
dat.norm.summ.w$median <- median_summarization(dat.norm.w)
display_dataframe_head(dat.norm.summ.w$median[, channelNames])
| 127C | 127N | 128C | 128N | 129C | 129N | 130C | 130N |
|---|---|---|---|---|---|---|---|
| -0.4092298 | -0.1673011 | 0.2141415 | 0.3318794 | 0.2110937 | -0.0030274 | 0.0030274 | -0.4648726 |
| -0.5248261 | -0.0762095 | 0.3164919 | 0.1996797 | 0.0311662 | 0.0681443 | 0.0185818 | -0.6414210 |
| -0.4006406 | -0.2139570 | 0.3876424 | 0.3337212 | 0.2583116 | -0.0188944 | 0.0188944 | -0.4659382 |
| -0.2285656 | -0.1436318 | 0.0574925 | 0.3790593 | 0.3363888 | 0.2180016 | -0.0574925 | -0.3891581 |
| 0.1427990 | -0.1600185 | 0.7483754 | 0.1524052 | -0.4018839 | -0.0658215 | 0.0658215 | -0.9682449 |
| -0.2304093 | -0.3565971 | 0.1488981 | 0.3135104 | 0.0583539 | -0.0485547 | 0.0485547 | -0.5092703 |
Let’s also summarize the non-normalized data for comparison later on.
# non-normalized data
dat.nonnorm.summ.w$median <- median_summarization(dat.unit.w)
iPQF uses several quantitative and qualitative peptide spectral characteristics to compute a weighted mean to approximate protein abundance. This means it can summarize directly from PSM to protein level while taking into account “features such as charge state, sequence length, identification score, mass and a distance metric within uniquely and redundantly measured spectra”. As iPQF is easiest to use through MSnbase, we first turn each MS run in our data set into an MSnSet object and then use the combineFeatures function with the iPQF option.
iPQF_summarization <- function(x) {
dat <- split(x, x$Run) # apply iPQF to each Run separately
# first make an MSnSet object
exprs <- lapply(dat, function(y) as.matrix(y[,channelNames]))
fdata <- lapply(dat, function(y) as.data.frame(y %>% select(-channelNames)) %>% rename(sequence='Peptide', accession='Protein', charge='Charge', modifications='PTM', mass_to_charge='DeltaMZ', search_engine_score='Ions.Score', Intensity='TotalIntensity'))
mss <- emptyList(names(dat))
for (i in seq_along(mss)) { mss[[i]] <- MSnSet(exprs = exprs[[i]], fData = fdata[[i]]) }
# then summarize
mss.norm <- lapply(mss, function(y) combineFeatures(y, method='iPQF', groupBy = fData(y)$accession, ratio.calc='none'))
mss.norm.tibble <- lapply(mss.norm, function(y) tibble(cbind(data.frame(Run=fData(y)$Run), rownames_to_column(as.data.frame(exprs(y)), var='Protein'))))
return(bind_rows(mss.norm.tibble))
}
# normalized data
dat.norm.summ.w$iPQF <- iPQF_summarization(dat.norm.w)
display_dataframe_head(dat.norm.summ.w$iPQF[, channelNames])
| 127C | 127N | 128C | 128N | 129C | 129N | 130C | 130N |
|---|---|---|---|---|---|---|---|
| -0.4341550 | -0.1700689 | 0.2599970 | 0.3723777 | 0.2029793 | -0.0005341 | 0.0003044 | -0.4528157 |
| -0.5577097 | -0.0756276 | 0.3076754 | 0.1813082 | 0.0238412 | 0.0926545 | 0.0006281 | -0.6809088 |
| -0.4006406 | -0.2139570 | 0.3876424 | 0.3337212 | 0.2583116 | -0.0188944 | 0.0188944 | -0.4659382 |
| -0.2285656 | -0.1436318 | 0.0574925 | 0.3790593 | 0.3363888 | 0.2180016 | -0.0574925 | -0.3891581 |
| 0.1427990 | -0.1600185 | 0.7483754 | 0.1524052 | -0.4018839 | -0.0658215 | 0.0658215 | -0.9682449 |
| -0.2304093 | -0.3565971 | 0.1488981 | 0.3135104 | 0.0583539 | -0.0485547 | 0.0485547 | -0.5092703 |
Here, too, let’s summarize the non-normalized data for comparison later on.
# non-normalized data
dat.nonnorm.summ.w$iPQF <- iPQF_summarization(dat.unit.w)
Sum summarization is completely analogous to the Median summarization, except that we sum values instead of taking the median. Note that sum normalization is not equivalent to mean normalization: yes, rows containing NA values are removed, but there may be multiple PSMs per peptide and multiple peptides per protein. Since we know that there is a strong peptide-run interaction, summing values across peptides per protein may result in strong bias by run.
sum_summarization <- function(dat) {
# group by (run,)protein,peptide then summarize twice (once on each level)
# add select() statement because summarise_at is going bananas over character columns
return(dat %>% group_by(Run, Protein, Peptide) %>% select(Run, Protein, Peptide, channelNames) %>% summarise_at(.vars = channelNames, .funs = sum) %>% select(Run, Protein, channelNames) %>% summarise_at(.vars = channelNames, .funs = sum) %>% ungroup())
}
# normalized data
dat.norm.summ.w$sum <- sum_summarization(dat.norm.w)
display_dataframe_head(dat.norm.summ.w$sum[, channelNames])
| 127C | 127N | 128C | 128N | 129C | 129N | 130C | 130N |
|---|---|---|---|---|---|---|---|
| -3.3745770 | -1.2512419 | 1.6393313 | 2.5813761 | 1.2186853 | -0.0528255 | 0.0813142 | -3.4513748 |
| -2.2370267 | -0.4144577 | 1.1292273 | 0.6625648 | 0.1962396 | 0.2949377 | 0.0176503 | -2.6657243 |
| -0.8012813 | -0.4279141 | 0.7752849 | 0.6674423 | 0.5166232 | -0.0377888 | 0.0377888 | -0.9318764 |
| -0.2285656 | -0.1436318 | 0.0574925 | 0.3790593 | 0.3363888 | 0.2180016 | -0.0574925 | -0.3891581 |
| 0.1427990 | -0.1600185 | 0.7483754 | 0.1524052 | -0.4018839 | -0.0658215 | 0.0658215 | -0.9682449 |
| -0.2304093 | -0.3565971 | 0.1488981 | 0.3135104 | 0.0583539 | -0.0485547 | 0.0485547 | -0.5092703 |
Again, let’s also summarize the non-normalized data for comparison later on.
# non-normalized data
dat.nonnorm.summ.w$sum <- sum_summarization(dat.unit.w)
Now that the data is on the protein level, let’s sweep all values separately per protein in the columns/samples. This is slightly different from sweeping before the summarization step because the median of medians is not the same as the grand median, but this does not introduce any bias.
# medianSweeping: in each channel, subtract median computed across all proteins within the channel
# do the above separately for each MS run
dat.norm.summ.w <- lapply(dat.norm.summ.w, function(x) {
x.split <- split(x, x$Run)
x.split.norm <- lapply(x.split, function(y) {
y[,channelNames] <- median_sweep(y[,channelNames], 2, '-')
return(y)})
dat.norm.summ.w <- bind_rows(x.split.norm)
})
Before getting to the DEA section, let’s do some basic quality control and take a sneak peek at the differences between the component variants we’ve chosen. First, however, we should make the data completely wide, so that each sample gets it’s own unique column.
# make data completely wide (also across runs)
## normalized data
dat.norm.summ.w2 <- lapply(dat.norm.summ.w, function(x) {
return(x %>% pivot_wider(names_from = Run, values_from = all_of(channelNames), names_glue = "{Run}:{.value}"))
})
colnames(dat.norm.summ.w2$median)
## [1] "Protein" "Mixture1_1:127C" "Mixture1_2:127C" "Mixture2_1:127C"
## [5] "Mixture2_2:127C" "Mixture1_1:127N" "Mixture1_2:127N" "Mixture2_1:127N"
## [9] "Mixture2_2:127N" "Mixture1_1:128C" "Mixture1_2:128C" "Mixture2_1:128C"
## [13] "Mixture2_2:128C" "Mixture1_1:128N" "Mixture1_2:128N" "Mixture2_1:128N"
## [17] "Mixture2_2:128N" "Mixture1_1:129C" "Mixture1_2:129C" "Mixture2_1:129C"
## [21] "Mixture2_2:129C" "Mixture1_1:129N" "Mixture1_2:129N" "Mixture2_1:129N"
## [25] "Mixture2_2:129N" "Mixture1_1:130C" "Mixture1_2:130C" "Mixture2_1:130C"
## [29] "Mixture2_2:130C" "Mixture1_1:130N" "Mixture1_2:130N" "Mixture2_1:130N"
## [33] "Mixture2_2:130N"
## non-normalized data
dat.nonnorm.summ.w2 <- lapply(dat.nonnorm.summ.w, function(x){
return(x %>% pivot_wider(names_from = Run, values_from = all_of(channelNames), names_glue = "{Run}:{.value}") )
})
These boxplots of both the raw and normalized intensities show that the distributions of Median- and iPQF-summarized values are very similar and symmetrical. In contrast, the Sum summarization produces very skewed distributions. Although for all summarization methods the boxplots are centered after normalization, this skewness of the Sum summarized values is ominous.
# use (half-)wide format
par(mfrow=c(1,2))
for (i in 1:n.comp.variants){
boxplot_w(dat.nonnorm.summ.w[[variant.names[i]]],sample.info, paste('raw', variant.names[i], sep='_'))
boxplot_w(dat.norm.summ.w[[variant.names[i]]], sample.info, paste('normalized', variant.names[i], sep='_'))}
We then make MA plots of two single samples taken from condition 0.5 and condition 0.125, measured in different MS runs (samples Mixture2_1:127C and Mixture1_2:129N, respectively). Clearly, the normalization had a strong variance-reducing effect on the fold changes. It seems that fold changes associated with Sum summarization experience a strong bias (blue curve is rolling average) for low abundance proteins after normalization.
for (i in 1:n.comp.variants){
p1 <- maplot_ils(dat.nonnorm.summ.w2[[variant.names[i]]], ma.onesample.num, ma.onesample.denom, scale.vec, paste('raw', variant.names[i], sep='_'), spiked.proteins)
p2 <- maplot_ils(dat.norm.summ.w2[[variant.names[i]]], ma.onesample.num, ma.onesample.denom, scale.vec, paste('normalized', variant.names[i], sep='_'), spiked.proteins)
grid.arrange(p1, p2, ncol=2)}
To increase the robustness of these results, let’s make some more MA plots, but now for all samples from condition 0.5 and condition 0.125 (quantification values averaged within condition). Indeed, even the raw, unnormalized data now show less variability, and again even more so for the Median summarization normalized data. It seems that by using more samples (now 8 in both the enumerator and denominator instead of just one) in the fold change calculation the rolling average is more robust and the Sum summarization data bias has been reduced (but not disappeared).
channels.num <- sample.info %>% filter(Condition==ma.allsamples.num) %>% select(Sample) %>% pull
channels.denom <- sample.info %>% filter(Condition==ma.allsamples.denom) %>% select(Sample) %>% pull
for (i in 1:n.comp.variants){
p1 <- maplot_ils(dat.nonnorm.summ.w2[[variant.names[i]]], channels.num, channels.denom, scale=scale.vec, paste('raw', variant.names[i], sep='_'), spiked.proteins)
p2 <- maplot_ils(dat.norm.summ.w2[[variant.names[i]]], channels.num, channels.denom, scale=scale.vec, paste('normalized', variant.names[i], sep='_'), spiked.proteins)
grid.arrange(p1, p2, ncol=2)}
Now, let’s check if these multi-dimensional data contains some kind of grouping; It’s time to make PCA plots. Even though PC1 does seem to capture the conditions, providing a gradient for the dilution number, only the 0.125 condition is completely separable in the normalized data. Here, clearly the Sum summarization is insufficient and does not get further than merely scrambling the samples and not leaving them clustered according to run.
par(mfrow=c(1,2))
for (i in 1:n.comp.variants){
pcaplot_ils(dat.nonnorm.summ.w2[[variant.names[i]]] %>% select(-'Protein'), info=sample.info, paste('raw', variant.names[i], sep='_'))
pcaplot_ils(dat.norm.summ.w2[[variant.names[i]]] %>% select(-'Protein'), info=sample.info, paste('normalized', variant.names[i], sep='_'))}
There are only 19 proteins supposed to be differentially expressed in this data set, which is only a very small amount in both relative (to the 4083 proteins total) and absolute (for a biological sample) terms.
Therefore, let’s see what the PCA plots look like if we were to only use the spiked proteins in the PCA. Now, there are clear differences between the raw or non-normalized Median and iPQF) plots, but after normalization they are very similar! This time, the separation between different conditions has become more distinct, which suggests the experiment was carried out successfully. Even here, Sum summarization does not properly separate the samples according to condition.
par(mfrow=c(1,2))
for (i in 1:n.comp.variants){
pcaplot_ils(dat.nonnorm.summ.w2[[variant.names[i]]] %>% filter(Protein %in% spiked.proteins) %>% select(-'Protein'), info=sample.info, paste('raw', variant.names[i], sep='_'))
pcaplot_ils(dat.norm.summ.w2[[variant.names[i]]] %>% filter(Protein %in% spiked.proteins) %>% select(-'Protein'), info=sample.info, paste('normalized', variant.names[i], sep='_'))}
Notice how for all PCA plots, the percentage of variance explained by PC1 is now much greater than when using data from all proteins. In a real situation without spiked proteins, you might plot data corresponding to the top X most differential proteins instead.
The PCA plots of all proteins has a rather lower fraction of variance explained by PC1. We can confirm this using the hierarchical clustering dendrograms below: when considering the entire multidimensional space, the different conditions are not very separable at all. This is not surprising as there is little biological variation between the conditions: there are only 19 truly differential proteins, and they all (ought to) covary in exactly the same manner (i.e., their variation can be captured in one dimension).
par(mfrow=c(1,2))
for (i in 1:n.comp.variants){
dendrogram_ils(dat.nonnorm.summ.w2[[variant.names[i]]] %>% select(-Protein), info=sample.info, paste('raw', variant.names[i], sep='_'))
dendrogram_ils(dat.norm.summ.w2[[variant.names[i]]] %>% select(-Protein), info=sample.info, paste('normalized', variant.names[i], sep='_'))}
Our last quality check involves a measure of how well each variant was able to assist in removing the run effect. Below are the distributions of p-values from a linear model for the response variable with Run as a covariate. If the run effect was removed successfully, these p-values ought to be large. Clearly, the raw data contains a run effect, which is partially removed by Median and iPQF summarization and - surprisingly - even better by Sum summarization. The latter may be a consequence of the fact that this experiment was carried out very carefully: the sample sizes are near-identical and the simulated variance we added averages out across multiple samples, rendering the total summed intensities of proteins more or less stable across runs.
plots <- vector('list', n.comp.variants)
for (i in 1:n.comp.variants){
dat <- list(dat.nonnorm.summ.l[[variant.names[i]]], dat.norm.summ.l[[variant.names[i]]])
names(dat) <- c(paste('raw', variant.names[i], sep='_'), paste('normalized', variant.names[i], sep='_'))
plots[[i]] <- run_effect_plot(dat)}
grid.arrange(grobs = plots, nrow=n.comp.variants)
We look at the log2 fold changes of each condition w.r.t. the reference condition with dilution ratio 0.125. Since we are working with a log2 unit scale already, this means that for each protein we just look at the difference in mean observation across all channels between one condition and the reference condition. Note that this is not the same as looking at the log2 of the ratio of mean raw intensities for each condition (left hand side below), nor the mean ratio of raw intensities for each condition (right hand side below), since \(log_2 (\frac{mean(B)}{mean(A)}) \neq \frac{mean(log_2 (B))}{mean(log_2 (A))} \neq mean(\frac{log_2 (B)}{log_2 (A)})\).
To check whether these fold changes are significant (criterium: \(q<0.05\)), we use a moderated t-test slightly adapted from the limma package, which in use cases like ours should improve statistical power over a regular t-test. In a nutshell, this is a t-test done independently for each protein, although the variance used in the calculation of the t-statistic is moderated using some empirical Bayes estimation. After testing, we make a correction for multiple testing using the Benjamini-Hochberg method in order to keep the FDR under control.
# design matrix as used in ANOVA testing.
design.matrix <- get_design_matrix(referenceCondition, sample.info)
dat.dea <- emptyList(names(dat.norm.summ.w2))
for (i in 1:n.comp.variants){
# provide scale so moderated_ttest knows whether you input log2 or raw intensities.
this_scale <- scale.vec
d <- column_to_rownames(as.data.frame(dat.norm.summ.w2[[variant.names[i]]]), 'Protein')
dat.dea[[variant.names[i]]] <- moderated_ttest(dat=d, design.matrix, scale=this_scale)}
For each condition, we now get the fold changes, moderated and unmoderated p-values, moderated and unmoderated q-values (BH-adjusted p-values), and some other details (head of dataframe below).
display_dataframe_head(dat.dea[[1]])
| logFC_0.125 | logFC_0.667 | logFC_1 | t.ord_0.125 | t.ord_0.667 | t.ord_1 | t.mod_0.125 | t.mod_0.667 | t.mod_1 | p.ord_0.125 | p.ord_0.667 | p.ord_1 | p.mod_0.125 | p.mod_0.667 | p.mod_1 | q.ord_0.125 | q.ord_0.667 | q.ord_1 | q.mod_0.125 | q.mod_0.667 | q.mod_1 | df.r | df.0 | s2.0 | s2 | s2.post | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| A0AVT1 | -0.0296332 | -0.0190075 | 0.0086228 | -1.2236300 | -0.7848676 | 0.3560550 | -1.1698892 | -0.7503969 | 0.3404173 | 0.2312896 | 0.4391218 | 0.7244682 | 0.2512504 | 0.4588575 | 0.7359131 | 0.9569613 | 0.9960123 | 0.9898622 | 0.9350748 | 0.9805978 | 0.9840728 | 28 | 2.018963 | 0.0056242 | 0.0023459 | 0.0025664 |
| A0FGR8 | 0.0121287 | 0.0389655 | 0.0247813 | 0.3627376 | 1.1653597 | 0.7411452 | 0.3596350 | 1.1553923 | 0.7348061 | 0.7195248 | 0.2537054 | 0.4647764 | 0.7216378 | 0.2570437 | 0.4681594 | 0.9841943 | 0.9960123 | 0.9820911 | 0.9795638 | 0.9723943 | 0.9650451 | 28 | 2.018963 | 0.0056242 | 0.0044720 | 0.0045495 |
| A0MZ66 | 0.0035130 | -0.0009938 | -0.0183834 | 0.0761086 | -0.0215311 | -0.3982773 | 0.0769941 | -0.0217816 | -0.4029113 | 0.9398739 | 0.9829747 | 0.6934465 | 0.9391391 | 0.9827663 | 0.6898687 | 0.9978394 | 0.9975071 | 0.9898622 | 0.9964971 | 0.9975410 | 0.9825206 | 28 | 2.018963 | 0.0056242 | 0.0085220 | 0.0083271 |
| A1L0T0 | 0.0520381 | -0.0328378 | -0.0500166 | 0.7951963 | -0.5017969 | -0.7643056 | 0.8165216 | -0.5152539 | -0.7848025 | 0.4358354 | 0.6212896 | 0.4536089 | 0.4229483 | 0.6115101 | 0.4409319 | 0.9569613 | 0.9975071 | 0.9820911 | 0.9350748 | 0.9924981 | 0.9650451 | 20 | 2.018963 | 0.0056242 | 0.0128474 | 0.0121851 |
| A3KMH1 | 0.4342224 | 0.3206653 | 0.5484563 | 1.7510253 | 1.2931002 | 2.2116798 | 1.8344616 | 1.3547163 | 2.3170662 | 0.0952654 | 0.2107143 | 0.0387933 | 0.0801348 | 0.1892462 | 0.0301887 | 0.9569613 | 0.9960123 | 0.7960743 | 0.9350748 | 0.9540539 | 0.7680706 | 20 | 2.018963 | 0.0056242 | 0.1844849 | 0.1680848 |
| A4D1E9 | -0.0219254 | -0.0731552 | -0.0426655 | -0.4771987 | -1.5921945 | -0.9285977 | -0.4796669 | -1.6004298 | -0.9334006 | 0.6383938 | 0.1270236 | 0.3641701 | 0.6361944 | 0.1237544 | 0.3607366 | 0.9741019 | 0.9841938 | 0.9820911 | 0.9644451 | 0.9260254 | 0.9650451 | 20 | 2.018963 | 0.0056242 | 0.0063331 | 0.0062681 |
Now, the most important part: let’s find out how our component variants have affected the outcome of the DEA.
A confusion matrix shows how many true and false positives/negatives each variant has given rise to. Spiked proteins that are DE are true positives, background proteins that are not DE are true negatives. We calculate this matrix for all conditions and then calculate some other informative metrics based on the confusion matrices: accuracy, sensitivity, specificity, positive predictive value and negative predictive value.
Clearly, across the board, Sum summarization underperforms while Median summarization and iPQF are on equal footing, with an acceptable but not spectacular sensitivity.
cm <- conf_mat(dat.dea, 'q.mod', 0.05, spiked.proteins)
print_conf_mat(cm, referenceCondition)
| background | spiked | background | spiked | background | spiked | |
|---|---|---|---|---|---|---|
| not_DE | 4061 | 4 | 4061 | 4 | 4063 | 15 |
| DE | 3 | 15 | 3 | 15 | 1 | 4 |
| median | iPQF | sum | |
|---|---|---|---|
| Accuracy | 0.998 | 0.998 | 0.996 |
| Sensitivity | 0.789 | 0.789 | 0.211 |
| Specificity | 0.999 | 0.999 | 1.000 |
| PPV | 0.833 | 0.833 | 0.800 |
| NPV | 0.999 | 0.999 | 0.996 |
| background | spiked | background | spiked | background | spiked | |
|---|---|---|---|---|---|---|
| not_DE | 4064 | 18 | 4064 | 18 | 4064 | 19 |
| DE | 0 | 1 | 0 | 1 | 0 | 0 |
| median | iPQF | sum | |
|---|---|---|---|
| Accuracy | 0.996 | 0.996 | 0.995 |
| Sensitivity | 0.053 | 0.053 | 0.000 |
| Specificity | 1.000 | 1.000 | 1.000 |
| PPV | 1.000 | 1.000 | NaN |
| NPV | 0.996 | 0.996 | 0.995 |
| background | spiked | background | spiked | background | spiked | |
|---|---|---|---|---|---|---|
| not_DE | 4060 | 5 | 4061 | 5 | 4063 | 15 |
| DE | 4 | 14 | 3 | 14 | 1 | 4 |
| median | iPQF | sum | |
|---|---|---|---|
| Accuracy | 0.998 | 0.998 | 0.996 |
| Sensitivity | 0.737 | 0.737 | 0.211 |
| Specificity | 0.999 | 0.999 | 1.000 |
| PPV | 0.778 | 0.824 | 0.800 |
| NPV | 0.999 | 0.999 | 0.996 |
To see whether the three Summarization methods produce similar results on the detailed level of individual proteins, we make scatter plots and check the correlation between their fold changes and between their significance estimates (q-values, in our case).
For all conditions, we see the iPQF correlates well (\(>0.9\)) with the Median summarization, especially for the spike-in proteins with low q-values. Towards \(q=1\) the correlation is worse, but that is not surprising as that is the domain of housekeeping proteins, which are still subject to stochasticity and moreover not of particular interest. The Sum summarization marked everything as very insignificant, except for but a handful of proteins.
scatterplot_ils(dat.dea, significance.cols, 'q-values', spiked.proteins, referenceCondition)
The correlation metween Median summarization and iPQF is even higher for the fold changes: \(>0.99\) for all conditions. This is to be expected based on the q-value plots, as a large difference in the test statistic can be due to a small difference in fold change. The plots involving Sum summarization have an anomaly around 0 fold change, and the values of the spike-in proteins are not very well correlated with those of the other methods.
scatterplot_ils(dat.dea, logFC.cols, 'log2FC', spiked.proteins, referenceCondition)
The volcano plot combines information on fold changes and statistical significance. The spike-in proteins are colored blue, and immediately it is clear that their fold changes dominate the region of statistical significance, which suggests the experiment and analysis were carried out successfully. The magenta, dashed line indicates the theoretical fold change of the spike-ins.
for (i in 1:n.contrasts){
volcanoplot_ils(dat.dea, i, spiked.proteins, referenceCondition)}
A good way to assess the general trend of the fold change estimates on a more ‘macroscopic’ scale is to make a violin plot. Ideally, there will be some spike-in proteins that attain the expected fold change (red dashed line) that corresponds to their condition, while most (background) protein log2 fold changes are situated around zero.
Clearly, the empirical results tend towards the theoretical truth, but not a single observation attained the fold change it should have attained. There is clearly a strong bias towards zero fold change, which may partly be explained by the ratio compression phenomenon in mass spectrometry, although the effect seems quite extreme here. It seems that Median summarization and iPQF produce very similar violins, while Sum summarization is again the odd one out. Even though the Sum-associated values are closer to their theoretically expected values, in light of the rest of our analysis it seems more plausible that this is due to the entire distribution suffering increased variability, rather than that the Sum summarization would produce more reliable outcomes.
# plot theoretical value (horizontal lines) and violin per variant
violinplot_ils(lapply(dat.dea, function(x) x[spiked.proteins, logFC.cols]), referenceCondition)
save(dat.nonnorm.summ.l
,dat.norm.summ.l
,dat.nonnorm.summ.w
,dat.norm.summ.w
,dat.nonnorm.summ.w2
,dat.norm.summ.w2
,dat.dea, file=paste0('datadriven_summarization_outdata_', params$suffix_p, '.rda'))
For the given data set, the differences in proteomic outcomes between Median and iPQF normalization are quite small, both on the global and individual scale. Their differences with the Sum summarization outcomes are very large. The QC plots suggest that the Sum summarization obfuscates much of the work done by the normalization. This may be explained by the fact that sum summarization promotes not only abundant detections (at the PSM level) of proteins, but also repeated detections. Since repetition of detections in MS experiments is far from robust across runs, this is expected to introduce a great amount of stochasticity. Therefore, Sum summarization does not work when you have data from multiple runs. In such a case, it would be advised to use summarization method instead.
sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=de_BE.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=de_BE.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=de_BE.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=de_BE.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] forcats_0.5.0 stringr_1.4.0 dplyr_1.0.2
## [4] purrr_0.3.4 readr_1.4.0 tidyr_1.1.2
## [7] tibble_3.0.4 ggplot2_3.3.2 tidyverse_1.3.0
## [10] MSnbase_2.15.7 ProtGenerics_1.21.0 S4Vectors_0.27.14
## [13] mzR_2.23.1 Rcpp_1.0.5 Biobase_2.49.1
## [16] BiocGenerics_0.35.4 psych_2.0.9 limma_3.45.19
## [19] kableExtra_1.3.1 dendextend_1.14.0 gridExtra_2.3
## [22] stringi_1.5.3
##
## loaded via a namespace (and not attached):
## [1] colorspace_1.4-1 ellipsis_0.3.1 class_7.3-17
## [4] fs_1.5.0 rstudioapi_0.11 farver_2.0.3
## [7] affyio_1.59.0 prodlim_2019.11.13 fansi_0.4.1
## [10] lubridate_1.7.9 xml2_1.3.2 codetools_0.2-16
## [13] splines_4.0.3 ncdf4_1.17 mnormt_2.0.2
## [16] doParallel_1.0.16 impute_1.63.0 knitr_1.30
## [19] jsonlite_1.7.1 pROC_1.16.2 caret_6.0-86
## [22] broom_0.7.2 vsn_3.57.0 dbplyr_1.4.4
## [25] BiocManager_1.30.10 compiler_4.0.3 httr_1.4.2
## [28] backports_1.1.10 assertthat_0.2.1 Matrix_1.2-18
## [31] cli_2.1.0 htmltools_0.5.0 tools_4.0.3
## [34] gtable_0.3.0 glue_1.4.2 affy_1.67.1
## [37] reshape2_1.4.4 MALDIquant_1.19.3 cellranger_1.1.0
## [40] vctrs_0.3.4 preprocessCore_1.51.0 nlme_3.1-150
## [43] iterators_1.0.13 timeDate_3043.102 xfun_0.18
## [46] gower_0.2.2 rvest_0.3.6 lifecycle_0.2.0
## [49] XML_3.99-0.5 zlibbioc_1.35.0 MASS_7.3-53
## [52] scales_1.1.1 ipred_0.9-9 pcaMethods_1.81.0
## [55] hms_0.5.3 yaml_2.2.1 rpart_4.1-15
## [58] highr_0.8 foreach_1.5.1 e1071_1.7-4
## [61] BiocParallel_1.23.3 lava_1.6.8 rlang_0.4.8
## [64] pkgconfig_2.0.3 mzID_1.27.0 evaluate_0.14
## [67] lattice_0.20-41 recipes_0.1.14 labeling_0.4.2
## [70] tidyselect_1.1.0 plyr_1.8.6 magrittr_1.5
## [73] R6_2.4.1 IRanges_2.23.10 generics_0.0.2
## [76] DBI_1.1.0 pillar_1.4.6 haven_2.3.1
## [79] withr_2.3.0 mgcv_1.8-33 survival_3.2-7
## [82] nnet_7.3-14 modelr_0.1.8 crayon_1.3.4
## [85] tmvnsim_1.0-2 rmarkdown_2.5 viridis_0.5.1
## [88] grid_4.0.3 readxl_1.3.1 data.table_1.13.2
## [91] blob_1.2.1 ModelMetrics_1.2.2.2 reprex_0.3.0
## [94] digest_0.6.27 webshot_0.5.2 munsell_0.5.0
## [97] viridisLite_0.3.0